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Abstract 

The  original  and  most  widely  studied  PAC  model  for  learning  assumes  a  passive  learner  in  the 
sense  that  the  learner  plays  no  role  in  obtaining  information  about  the  unknown  concept.  That  is, 
the  samples  are  simply  drawn  independently  from  some  probability  distribution.  Some  work  has 
been  done  on  studying  more  powerful  oracles  and  how  they  affect  learnability.  To  find  bounds  on 
the  improvement  that  can  be  expected  from  using  oracles,  we  consider  active  learning  in  the  sense 
that  the  learner  has  complete  choice  in  the  information  received.  Specifically,  we  allow  the  learner 
to  ask  arbitrary  yes/no  questions.  We  consider  both  active  learning  under  a  fixed  distribution 
and  distribution-free  active  learning.  In  the  case  of  active  learning,  the  underlying  probability 
distribution  is  used  only  to  measure  distance  between  concepts.  For  learnability  with  respect  to 
a  fixed  distribution,  active  learning  does  not  enlarge  the  set  of  learnable  concept  classes,  but  can 
improve  the  sample  complexity.  For  distribution-free  learning,  it  is  shown  that  a  concept  class 
is  actively  learnable  iff  it  is  finite,  so  that  active  learning  is  in  fact  less  powerful  than  the  usual 
passive  learning  model.  We  also  consider  a  form  of  distribution-free  learning  in  which  the  learner 
knows  the  distribution  being  used,  so  that  ‘distribution-free’  refers  only  to  the  requirement  that  a 
bound  on  the  number  of  queries  can  be  obtained  uniformly  over  all  distributions.  Even  with  the 
side  information  of  the  distribution  being  used,  a  concept  class  is  actively  learnable  iff  it  has  finite 
VC  dimension,  so  that  active  learning  with  the  side  information  stiU  does  not  enlarge  the  set  of 
learnable  concept  classes. 
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partnieut  of  the  Navy  under  Air  Force  Contract  Fiy6'28-9U-C-0002,  and  by  the  National  Science  Foundation  under 
contract  Et!S-8552419. 
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1  Introduction 


The  PAC  learning  model  introduced  by  Valiant  [20]  provides  a  framework  for  studying  the 
problem  of  learning  from  examples.  In  this  model,  the  learner  attempts  to  approximate  a  concept 
unknown  to  him  from  a  set  of  positive  and  negative  examples  of  the  concept.  The  examples  are 
drawn  from  some  unknown  probability  distribution,  and  the  seune  distribution  is  used  to  measure 
the  distance  between  concepts.  After  some  finite  number  of  examples,  the  learner  is  required  only 
to  output  with  high  probability  a  hypothesis  close  to  the  true  concept.  A  collection  of  concepts, 
called  a  concept  class,  is  said  to  be  learnable  if  a  bound  on  the  number  of  examples  needed  to 
achieve  a  certain  accuracy  and  confidence  in  the  hypothesis  can  be  obtained  uniformly  over  all 
concepts  in  the  concept  class  and  aU  underlying  probability  distributions. 

One  goal  of  studying  such  a  formal  framework  is  to  be  able  to  characterize  in  a  precise  sense  the 
tractability  of  learning  problems.  For  the  PAC  model,  Blumer  et  al.  [6]  showed  that  a  concept  class 
is  learnable  iff  it  has  finite  V C  dimension,  and  they  provided  upper  and  lower  botmds  on  the  nmnber 
of  examples  needed  in  this  case.  The  requirement  that  a  concept  class  have  finite  VC  dimension  is 
quite  restrictive.  There  are  many  concept  classes  of  practical  interest  with  infinite  VC  dimension 
that  one  would  like  to  be  and/or  feel  should  be  learnable.  In  fact,  even  some  concept  classes  of 
interest  in  low  dimensional  Euclidean  spaces  are  not  learnable.  For  applications  such  as  image 
analysis,  machine  vision,  and  system  identification,  the  concepts  might  be  subsets  of  some  infinite 
dimensional  function  space  and  the  concept  classes  generally  will  not  have  finite  VC  dimension. 
Hence,  for  many  applications  the  original  PAC  model  is  too  restrictive  in  the  sense  that  not  enough 
problems  are  learnable  in  this  framework. 

A  natural  direction  to  pursue  is  to  consider  extensions  or  modifications  of  the  original  framework 
which  enlarge  the  set  of  learnable  concept  classes.  Two  general  approaches  are  to  relax  the  learning 
requirements  and  to  increase  the  power  of  the  learner-environment  or  learner-teacher  interactions. 
A  considerable  amount  of  work  has  been  done  along  these  lines.  For  example,  learnability  with 
respect  to  a  class  of  distributions  (as  opposed  to  the  original  distribution-free  framework)  has  been 
studied  [5,  13,  14,  15].  Notably,  Benedek  and  Itai  [5]  first  studied  learnability  with  respect  to  a 
fixed  and  known  probability  distribution,  and  characterized  learnability  in  this  case  in  terms  of 
the  metric  entropy  of  the  concept  class.  Others  have  considered  particular  instances  of  learnability 
with  respect  to  a  fixed  distribution.  Regarding  the  learner-environment  interactions,  in  the  original 
model  the  examples  provided  to  the  learner  are  obtained  from  some  probability  distribution  which 
the  learner  has  no  control  over.  In  this  sense,  the  model  assumes  a  purely  passive  learner.  There 
has  been  quite  a  bit  of  work  done  on  increasing  the  power  of  the  learner’s  information  gathering 
mechanism.  For  example,  Angluin  [3,  4]  has  studied  a  variety  of  oracles  and  their  effect  on  learning, 
Amsterdam  [1]  considered  a  model  which  gives  the  learner  some  control  over  the  choice  of  examples 
by  allowing  the  learner  to  focus  attention  on  some  chosen  region  of  the  instance  space,  and  Eisenberg 
and  Rivest  [8]  studied  the  effect  on  sample  complexity  of  allowing  membership  queries  in  addition 
to  random  examples. 

In  this  paper,  we  study  the  limits  of  what  can  be  gained  by  allowing  the  most  general  set  of 
binary  valued  learner-environment  interactions,  which  give  the  learner  complete  control  over  the 
information  gathering.  Specifically,  we  consider  completely  ‘active’  learning  in  that  the  the  learner 
is  allowed  to  ask  arbitrary  yes/no  (i.e.,  binary  valued)  questions,  and  these  questions  need  not  be 
decided  on  beforehand.  That  is,  the  questions  the  learner  asks  can  depend  on  previous  answers  and 
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can  also  be  generated  randomly.  Many  of  the  oracles  previously  considered  in  the  literature  are 
simply  particular  types  of  yes/no  questions  (although  those  oracles  that  provide  counterexamples 
are  not).  Both  active  learning  with  respect  to  a  fixed  distribution  and  distribution-free  active 
learning  are  considered.  Since  we  are  concerned  with  active  learning,  the  probability  distribution 
is  not  used  to  generate  the  examples,  but  is  used  only  to  measure  the  distance  between  concepts. 

Definitions  of  passive  and  active  learning  are  provided  in  Section  2.  In  Section  3,  active  learning 
with  respect  to  a  fixed  distribution  is  considered.  A  simple  information  theoretic  argument  shows 
that  active  learning  does  not  enlarge  the  set  of  learnable  concept  classes,  but  as  expected  can  reduce 
the  sample  complexity  of  learning.  In  Section  4,  distribution-free  active  learning  is  considered.  In 
this  case,  active  learning  can  take  place  oidy  in  the  degenerate  situation  of  a  finite  concept  class. 
We  also  consider  a  form  of  distribution-free  learning  in  which  we  assume  that  the  learner  knows 
the  distribution  being  used,  so  that  ‘distribution-free’  refers  only  to  the  requirement  that  a  bound 
can  be  obtained  on  the  number  of  yes/no  questions  required  independent  of  the  distribution  used 
to  measure  distance  between  concepts.  However,  even  in  this  case  active  learning  surprisingly  does 
not  enlarge  the  set  of  learnable  concept  classes,  but  does  reduce  the  sample  complexity  as  expected. 

2  Definitions  of  Passive  and  Active  Learnability 

The  definitions  below  follow  closely  the  notation  of  [6].  Let  X  be  a  set  which  is  assumed  to  be 
fixed  and  known.  X  is  sometimes  called  the  instance  space.  Typically,  X  is  taken  to  be  either 
R”  (especially  R^)  or  the  set  of  binary  n- vectors.  A  concept  is  a  subset  of  X,  and  a  collection  of 
concepts  C  C  2^  will  be  called  a  concept  class.  An  element  x  E  X  will  be  called  a  sample,  and  a  pair 
{x,  a)  with  X  E  X  and  a  E  {0, 1}  will  be  called  a  labeled  sample.  Likewise,  x  =  (ri, . . . ,  Xm)  S  X"* 
is  called  an  m-sample,  and  a  labeled  m-sample  is  an  m- tuple  ((*1,01), .. .,  {xm,am))  where  Oj  =  aj 
if  Xi  =  Xj.  For  X  =  (*1, . . .,  x-m)  G  X’”  and  c  E  C,  the  labeled  m-sample  of  c  generated  by  x  is 
given  by  samc{x)  —  ((®i,ic(:ci))?  •  •  •?  (®m5-^c(®m)))  where  Jc(-)  is  the  indicator  function  for  the  set 
c.  The  sample  space  of  C  is  denoted  by  Sc  and  consists  of  all  labeled  m-samples  for  all  c  G  (7,  all 
X  E  X’”,  and  all  m  >  1. 

Let  H  be  a  collection  of  subsets  of  X.  H  is  called  the  hypothesis  class,  and  the  elements  of 
H  are  called  hypotheses.  Let  Fch  be  the  set  of  all  functions  f  •.  Sc  H.  Given  a  probability 
distribution  P  on.  X,  the  error  of  /  with  respect  to  P  for  a  concept  ceC  and  sample  x  is  defined 
as  error =  P{cAh)  where  h  =  f{samc{x))  and  cAh  denotes  the  syimnetric  difference  of  the 
sets  c  and  h.  Finally,  in  the  definition  of  passive  learnability  to  be  given  below,  the  samples  used 
in  forming  a  hypothesis  will  be  drawn  from  X  independently  according  to  the  same  probability 
measure  P.  Hence,  an  m-sample  will  be  drawn  from  X’”  according  to  the  product  meastue  P^. 
We  can  now  state  the  following  definition  of  passive  learnability  for  a  class  of  distributions. 

Definition  1  (Passive  Learnability  for  a  Class  of  Distributions)  LetV  be  a  fixed  and  known 
collection  of  probability  measures.  The  pair  (C,H)  is  said  to  be  passively  learnable  with  respect 
to  V  if  there  exists  a  function  f  E  Fch  such  that  for  every  €,6  >  Q  there  is  a  Q  <  m{e,6)  <  00 
such  that  for  every  probability  measure  P  E  V  and  every  c  E  C ,  ifxE  X"^  is  chosen  at  random 
according  to  P’^  then  the  probability  that  error f^c,p{x)  <  e  is  greater  than  1  —  6. 

If  7^  is  the  set  of  all  probability  distributions  over  some  fixed  cr-algebra  of  X  (which  we  will  denote 
by  P*),  then  the  above  definition  reduces  to  the  version  from  Blumer  et  al.  [6]  of  Valiant’s  [20] 


3 


original  definition  (without  restrictions  on  computability)  for  learnability  for  all  distributions.  If  V 
consists  of  a  single  distribution  then  the  above  definition  reduces  to  that  used  by  Benedek  and  Itai 
[5].  As  often  done  in  the  literature,  we  will  be  considering  the  case  H  =  C  throughout,  so  that  we 
will  simply  speak  of  learnability  of  C  rather  than  learnability  of  {C,H). 

By  active  learning  we  will  mean  that  the  learner  is  allowed  to  ask  arbitrary  yes /no  questions.  We 
will  consider  only  the  case  H  =  C  throughout,  and  so  we  defiae  active  learnability  in  this  case  only. 
For  a  fixed  distribution,  the  only  object  imknown  to  the  learner  is  the  chosen  concept.  In  this  case, 
an  arbitrary  binary  question  provides  information  of  the  type  c  e  Co  where  Co  is  some  subset  of  C. 
That  is,  all  binary  questions  can  be  reduced  to  partitioning  C  into  two  pieces  and  asking  to  which 
of  the  two  pieces  does  c  belong.  For  distribution-free  learning  (or  more  generally,  learning  for  a 
class  of  distributions)  the  distribution  P  is  also  unknown.  In  this  case,  every  binary  question  can  be 
reduced  to  the  form  “Is  (c,  P)  €  qV'  where  q  is  an  arbitrary  subset  ofCxV,  so  that  C  and  V  can  be 
simultaneously  and  dependently  partitioned.  Thus,  the  information  the  active  learner  obtains  is  of 
the  form  ((qi,  Cl), . . . ,  {qm^  am))  where  qiCCxV  and  =  1  if  (c,  P)  €  qi  and  =  0  otherwise.  The 
qi  correspond  to  the  binary  valued  (i.e.,  yes/no)  questions  and  Oj  denotes  the  answer  to  the  question 
qi  when  the  true  concept  and  probability  measure  are  c  and  P  respectively.  In  general,  qi  can  be 
generated  randomly  or  deterministically  and  can  depend  on  all  previous  questions  £md  answers 
(gi, Cl), . . .,  (gi_i,ai_i).  The  qi  are  not  allowed  to  depend  explicitly  on  the  true  concept  c  and 
probability  measure  P,  but  can  depend  on  them  implicitly  through  answers  to  previous  questions. 
Let  q  =  {qi,...,qm)  denote  a  set  of  m  questions  generated  in  such  a  manner,  and  let  samc,p{q) 
denote  the  set  of  m  question  and  answer  pairs  when  the  true  concept  and  probability  measure  are 
c  and  P  respectively.  Let  Sc,v  denote  all  sets  of  m  question  and  answer  pairs  generated  in  such  a 
manner  for  all  c  G  (7,  P  e  P,  and  m  >  1.  By  an  active  learning  algorithm  we  mean  an  algorithm 
A  for  selecting  qi,...,qm  together  with  a  mapping  /  :  Sc,t  C  for  generating  a  hypothesis  from 
samc,p{q).  In  general,  A  and/or  /  may  be  nondeterministic,  which  results  in  probabilistic  active 
learning  algorithms.  If  both  A  and  /  are  deterministic  we  have  a  deterministic  active  learning 
algorithm.  Note  that  if  the  distribution  P  is  known  then  with  a  probabilistic  algorithm  an  active 
learner  can  simulate  the  information  received  by  a  passive  learner  by  simply  generating  random 
examples  and  asking  whether  they  are  elements  of  the  unknown  concept. 

Definition  2  (Active  Learnability  for  a  Class  of  Distributions)  Let  V  be  a  fixed  and  known 
collection  of  probability  measures.  C  is  said  to  be  actively  learnable  with  respect  to  V  if  there  exists 
a  function  f  :  Sc,v  C  such  that  for  every  €,S  >  0  there  is  a  0  <  m{e,  ^)  <  oo  such  that  for  every 
probability  measure  P  G  P  and  every  c  £  C,  if  h  =  /(sam(c,  P))  then  the  probability  (with  respect 
to  any  randomness  in  A  and  f)  that  P(hAc)  <  e  is  greater  than  1  —  S. 

3  Active  Learning  for  a  Fixed  Distribution 

In  this  section,  we  consider  active  learning  with  respect  to  a  fixed  and  known  probability  distribu¬ 
tion.  That  is,  P  consists  of  a  single  distribution  P  that  is  known  to  the  leeirner.  Benedek  and  Itai 
[5]  obtained  conditions  for  passive  learnability  in  this  case  in  terms  of  a  quantity  known  as  metric 
entropy. 

Definition  3  (Metric  Entropy)  Let  {Y,p)  be  a  metric  space.  Define  N{e)  =  N{€,Y,p)  to  be  the 
smallest  integer  n  such  that  there  exists  yi,...,2/„  G  Y  with  Y  —  Uf_iPe(2/i)  where  B^{yi)  is  the 
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open  ball  of  radius  e  centered  at  yi.  If  no  such  n  exists,  then  N{e,  Y,p)=  oo.  The  metric  entropy 
ofY  (often  called  the  e-entropy^  is  defined  to  be  logj  iV(e). 

N{€)  represents  the  smallest  number  of  balls  of  radius  e  which  are  required  to  cover  Y.  For  another 
interpretation,  suppose  we  wish  to  approximate  F  by  a  finite  set  of  points  so  that  every  element  of 
F  is  within  e  of  at  least  one  member  of  the  finite  set.  Then  N{e)  is  the  smallest  number  of  points 
possible  in  such  a  finite  approximation  of  F.  The  notion  of  metric  entropy  for  various  metric  spaces 
has  been  studied  and  used  by  a  munber  of  authors  (e.g.  see  [7,  12,  19]). 

In  the  present  application,  the  measure  of  error  dp{ci,C2)  =  P(ciAc2)  between  two  concepts 
with  respect  to  a  distribution  P  is  a  pseudo-metric.  Note  that  dp(-,  •)  is  generally  only  a  pseudo¬ 
metric  since  ci  and  C2  may  be  unequal  but  may  differ  on  a  set  of  measure  zero  with  respect  to  P. 
For  convenience,  if  P  is  a  distribution  we  will  use  the  notation  N{e,C,P)  (instead  of  N{€,C,dp)), 
and  we  will  speak  of  the  metric  entropy  of  C  with  respect  to  P,  with  the  imderstanding  that  the 
metric  being  used  is  dp(-,  •). 

Benedek  and  Itai  [5]  proved  that  a  concept  class  C  is  passively  learnable  for  a  fixed  distribution 
P  iff  (7  has  finite  metric  entropy  with  respect  to  P,  and  they  provided  upper  and  lower  bounds 
on  the  number  of  samples  reqtured.  Specifically,  they  showed  that  any  passive  learning  algorithm 
requires  at  least  log2(l  -  S)N{2e,C,P)  samples  and  that  (32/€)ln(iV(€/2)/^)  samples  is  sufficient. 
The  following  result  shows  that  the  same  condition  of  finite  metric  entropy  is  required  in  the  case 
of  active  learning.  In  active  learning,  the  learner  wants  to  encode  the  concept  class  to  an  accuracy  e 
with  a  binary  alphabet,  so  that  the  situation  is  essentially  an  elementary  problem  in  source  coding 
from  information  theory  [9].  However,  the  learner  wants  to  minimize  the  length  of  the  longest 
codeword  rather  than  the  mean  codeword  length. 

Theorem  1  A  concept  class  C  is  actively  learnable  with  respect  to  a  distribution  P  iff  N{€,  C,P)  < 
oo  for  all  €  >  0.  Furthermore,  [log2(l  —  6)N[2e,C,P)\  queries  are  necessary,  and  [log2(l  — 
S)N(e,C,P)]  queries  are  sufficient.  For  deterministic  learning  algorithms,  flog2  iV(€,  (7,  P)]  queries 
are  both  necessary  and  sufficient. 

Proof;  First  consider  ^  =  0.  [log2  N{e,  C,  P)]  questions  are  sufiicient  since  one  can  construct  an 
e-approximation  to  C  with  N{e,  C,  P)  concepts,  then  ask  [log2  N{e,  C,  P)]  questions  to  identify  one 
of  these  N{e,C,P)  concepts  that  is  within  e  of  the  true  concept.  [log2  iV(€,  (7,  P)]  questions  are 
necessary  since  by  definition  every  e-approximation  to  C  has  at  least  i\7(e,  C,  P)  elements.  Hence, 
with  any  fewer  questions  there  is  necessarily  a  concept  in  C  which  is  not  e-close  to  any  concept  the 
learner  might  output. 

The  essential  idea  of  the  argument  above  is  that  the  learner  must  be  able  to  encode  iV(e,  C,  P) 
distinct  possibilities  and  to  do  so  requires  [log2  N{e,  C,  P)]  questions.  Now,  for  ^  >  0,  the  learner  is 
allowed  to  make  a  mistake  with  probability  S.  In  this  case,  it  is  sufficient  that  the  learner  be  able  to 
encode  (1  —  S)N{€,  C,P)  possibilities  since  the  learner  cordd  first  randomly  select  (1  —  6)N{€,  C,P) 
concepts  from  an  e- approximation  of  N{€,  C,  P)  concepts  (each  with  equal  probability)  and  then 
ask  questions  to  select  one  of  the  (1  -  6)N{e,  C,  P)  concepts,  if  there  is  one,  that  is  e-close  to  the 
true  concept.  To  show  the  lower  bound,  first  note  that  we  can  find  iV(2e)  =  iV(2e,  <7,  P)  concepts 
Cl, ... ,  c;v(2e)  wliich  axe  pairwise  at  least  2€  apart  since  at  least  N{2e)  balls  of  radius  26  are  required 
to  cover  C.  Then  the  balls  B^{ci)  of  radius  e  centered  at  these  Cj  are  disjoint.  For  each  i,  if  c,  is 
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the  true  concept  then  the  learning  algorithm  must  output  a  hypothesis  h  6  B^{ci)  with  probability 
greater  than  1  —  Hence,  if  k  queries  are  asked,  then 


{1-6)N{2€,C,P)  < 


< 


N{2e) 

^  Pr(/i  G  He{ci)|c  =  Ci) 
i=l 

^(2*)  . 

/  Pr(h  €  He(ci)|c  =  Ci,  qi,. .  .,qk)dA{qi, . .  .,qk) 

i=l 


E  /  E  Pr(h  G  He(ci)|c  =  ci,  {qi,ai),. .  .,{qk,ak))dA{qi,. .  .,qk) 

i=l  Oil— 


N(U) 


I  E  E  Pr(J.6S,(=i)|c  =  Ci. 

Ol,...,0*  i=rl 

^(2e) 


(gi,  oi), . . . ,  (qk,  ak))dA{qt,  ...,qk) 


?  •  •  •  ?  (fffe? Ofc))dj4(gi, . .  .,qk) 


j  ^  dA{qi,...,qk) 


J  2’^dA{qi,...,qk) 
2*8 


where  the  integral  is  with  respect  to  any  randomness  in  the  questions,  the  fomth  equality  (i.e.  where 
conditioning  on  c  =  Ci  is  dropped)  follows  the  fact  that  the  hypothesis  generated  by  the  learner  is 
independent  of  the  true  concept  given  the  queries  and  answers,  and  the  second  inequality  follows 
from  the  fact  that  the  Bf{ci)  are  disjoint.  Thus,  since  is  an  integer,  fe>  [log2(l-^)iV(2e,C',P)l. 

Finally,  if  fewer  than  iV(e,  C,  P)  possibilities  are  encoded,  then  some  type  of  probabilistic  al¬ 
gorithm  must  necessarily  be  used,  since  otherwise  there  would  be  some  concept  which  the  learner 
would  always  fail  to  learn  to  within  e.  | 


Thus,  compjired  with  passive  learning  for  a  fixed  distribution,  active  learning  does  not  enlarge 
the  set  of  learnable  concept  classes,  but  as  expected,  fewer  queries  are  required  in  general.  However, 
only  a  factor  of  1/e,  some  constants,  and  a  factor  of  IjS  in  the  logarithm  are  gained  by  allowing 
active  learning,  which  may  or  may  not  be  significant  depending  on  the  behavior  of  iV(e,  C,  P)  as  a 
function  of  e. 

Note  that  in  active  learning  very  little  is  gained  by  allowing  the  learner  to  make  mistakes  with 
probability  S,  that  is,  there  is  a  very  weak  dependence  on  6  in  the  sample  size  botmds.  In  fact  for  any 
6  <  1/2,  we  have  log2(l  -  ^)7V(2e,  C,  P)  =  logs  iV(2€,  C',P)  “  logz  1/U  “  ^)  >  logs  iV'(2e,  C,  P)  -  1, 
so  that  even  allowing  the  learner  to  make  mistakes  half  the  time  results  in  the  lower  bound  differing 
from  the  upper  bound  and  the  botmd  for  ^  =  0  essentially  by  only  the  term  2e  versus  e  in  the  metric 
entropy.  Also,  note  that  Theorem  1  is  true  for  learnability  with  respect  to  an  arbitrary  metric  and 
not  just  those  induced  by  probability  measures. 
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4  Distribution-Free  Active  Learning 


Distribution-free  learning  (active  or  passive)  corresponds  to  the  case  where  V  is  the  set  of  all 
probability  measures  V*  over,  say,  the  Borel  <T-algebra.  A  fundamental  result  of  Blumer  et  al.  [6] 
relates  passive  learnability  for  all  distributions  (i.e.,  distribution-free)  to  the  Vapnik-Chervonenkis 
(VC)  dimension  of  the  concept  class  to  be  learned. 

Definition  4  (Vapnik-Chervonenkis  Dimension)  Let  C  C  2^ .  For  any  finite  set  S  C  X,  let 
nc(-S')  =  {5nc  :  c  €  C}.  S  is  said  to  be  shattered  by  C  ifJlc{S)  -  2^.  The  Vapnik-Chervonenkis 
dimension  of  C  is  defined  to  be  the  largest  integer  d  for  which  there  exists  a  set  S  C  X  of  cardinality 
d  such  that  S  is  shattered  by  C.  If  no  such  largest  integer  exists  then  the  VC  dimension  of  C  is 
infinite. 

Blumer  et  al.  [6]  proved  that  a  concept  class  C  is  learnable  for  all  distributions  iff  C  has  finite 
VC  dimension,  and  they  provided  upper  and  lower  bounds  on  the  number  of  samples  required. 
Specifically,  if  C  has  VC  dimension  d  <  oo  (and  satisfies  certain  measurability  conditions  that 
we  will  not  concern  ourselves  with)  they  showed  that  max(i log  |,  d(l  -  2{€  +  6  -  eS)))  samples 
are  necessary  and  max(|  log  |,  ^  log  ^)  samples  are  sufficient,  although  since  their  work  some 
refinements  have  been  made  in  these  bounds. 

The  case  of  distribution-free  active  learnability  is  a  little  more  subtle  than  active  learnability 
for  a  fixed  distribution.  For  both  active  and  passive  learning,  the  requirement  that  the  learning  be 
distribution-free  imposes  two  difficulties.  The  first  is  that  there  must  exist  a  uniform  bound  on  the 
number  of  examples  or  queries  over  all  distributions  —  i.e.,  a  bound  independent  of  the  underlying 
distribution.  The  second  is  that  the  distribution  is  tmknown  to  the  learner,  so  that  the  learner 
does  not  know  how  to  evaluate  distances  between  concepts.  Hence,  since  the  metric  is  unknown, 
the  learner  cannot  simply  replace  the  concept  class  with  a  finite  e-approximation  as  in  the  case  of 
a  fixed  and  known  distribution. 

For  passive  learnability,  the  requirement  that  the  concept  class  have  finite  VC  dimension  is 
necessary  and  sufficient  to  overcome  both  of  these  difficulties.  However,  for  active  learning  the 
second  difficulty  is  severe  enough  that  no  learning  can  take  place  as  long  as  the  concept  class  is 
infinite. 

Lemma  1  Let  C  be  an  infinite  set  of  concepts.  If  ci, . . .  ,Cn  £  C  is  any  finite  set  of  concepts  in  C 
then  there  exists  Cn+i  G  C  and  a  distribution  P  such  that  dp{cn+i,Ci)  >1/2  for  i  =  1,. .  .,n. 

Proof;  Consider  all  sets  of  the  form  n  62  n  •  •  •  n  where  6,  is  either  Cj  or  cj.  There  are  at  most 
2"  distinct  sets  Hi, ... ,  B2n  of  this  form.  Note  that  the  Bi  are  disjoint,  their  union  is  X,  and  each 
Cj  for  z  =  1, . . . ,  n  consists  of  a  union  of  certain  Bi.  Since  C  is  infinite,  there  is  a  set  Cn+\  €  C  such 
that  for  some  nonempty  H*,  Cn+i  D  Bk  is  nonempty  and  c„+i  n  Bk  ^  Bk-  Hence,  there  exist  points 
xi,X2  E  X  with  xi  G  Cn+i  flHfe  and  X2  G  Hfe  \c„+i.  Let  P  be  the  probability  measure  which  assigns 
probability  1/2  to  xi  and  1/2  to  X2.  For  each  i-  1, . . . ,  n,  either  Bk  C  Cj  or  Hfe  n  Cj  =  0.  Thus,  in 
either  case  c„+i  Acj  conteiins  exactly  one  of  ri  or  X2  so  that  <ip(c„+i,  Cj)  =  1/2  for  z  =  1, . . . ,  n.  | 


Theorem  2  C  is  actively  learnable  for  all  distributions  iff  C  is  finite. 
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Proof;  If  C  is  finite  it  is  clearly  actively  learnable  since  the  learner  need  only  ask  [log2  ICI] 
questions  where  [Cl  is  the  cardinality  of  C  to  decide  which  concept  is  the  correct  one. 

If  C  is  infinite  we  will  show  that  C  is  not  actively  learnable  by  showing  that  after  finitely  many 
questions  an  adversary  could  give  answers  so  that  there  are  still  infinitely  many  candidate  concepts 
which  are  far  apart  under  infinitely  many  remaining  probability  distributions.  Since  C  is  infinite, 
we  can  repeatedly  apply  the  lemma  above  to  obtain  an  infinite  sequence  of  concepts  Ci,  C2, . . .  and 
an  associated  sequence  of  probability  measures  Pi,P2,...  such  that  imder  the  distribution  Pi,  the 
concept  Ci  is  a  distance  1/2  away  from  all  preceding  concepts.  I.e.,  for  each  i  dp.{ci,Cj)  =  1/2  for 
j  =  l,...,i-  1. 

Now,  any  question  that  the  active  learner  can  ask  is  of  the  form  “Is  (c,  P)  G  g?”  where  q 
is  a  subset  oi  C  X  V.  Consider  the  pairs  {ci.  Pi),  (c2,  P2), . . ..  Either  q  or  q  (or  both)  contain 
an  infinite  number  of  the  pairs  {ci,Pi).  Thus,  an  adversary  could  always  give  an  answer  such 
that  an  infinite  number  of  pairs  (c^,  P,)  remain  as  candidates  for  the  true  concept  and  probability 
measure.  Similarly,  after  any  finite  number  of  questions  an  infinite  number  of  (ci,P)  pairs  remain 
as  candidates.  Thus,  by  the  property  that  dp^{ci,Cj)  =  1/2  for  j  =  1,.  ..,i  -  1,  it  follows  that  for 
any  €  <  1/2  the  active  learner  cannot  learn  the  target  concept.  | 

Essentially,  if  the  distribution  is  unknown,  then  the  active  learner  has  no  idea  about  ‘where’ 
to  seek  information  about  the  concept.  On  the  other  hand,  in  passive  learnability  the  examples 
are  provided  according  to  the  underlying  distribution,  so  that  information  is  obtained  in  regions  of 
importance.  Hence,  in  the  distribution-free  case,  random  samples  (from  the  distribution  used  to 
evaluate  performance)  are  indispensible. 

Suppose  that  we  remove  the  second  difficulty  by  assuming  that  the  learner  has  knowledge  of 
the  underlying  distribution.  Then  the  learner  knows  the  metric  being  used  and  so  can  form  a  finite 
approximation  to  the  concept  class.  In  this  case,  the  distribution-free  requirement  plays  a  part 
ordy  in  forcing  a  uniform  bound  on  the  number  of  queries  needed.  Certainly,  the  active  learner  can 
learn  any  concept  class  that  is  learnable  by  a  passive  learner  since  the  active  learner  could  simply 
ask  queries  according  the  known  distribution  to  simulate  a  passive  learner.  However,  the  following 
theorem  shows  that  active  learning,  even  with  the  side  information  as  to  the  distribution  being 
used,  does  not  enlarge  the  set  of  learnable  concept  classes. 

Theorem  3  If  the  learner  knows  the  underlying  probability  distribution  then  C  is  actively  learnable 
for  all  distributions  iff  C  has  finite  VC  dimension.  Furthermore,  [supplog2(l  —  S)N{2€,C,P)] 
questions  are  necessary  and  [supplog2(l  — ^)iV(e,  (7,  P)]  questions  are  sufficient.  For  deterministic 
algorithms  [supplogi\r(e,  C,  P)]  questions  are  both  necessary  and  sufficient. 

Proof;  If  the  distribution  is  known  to  the  learner,  then  the  result  of  Theorem  1  applies  for  each 
distribution.  Learnability  for  all  distributions  then  simply  imposes  the  uniform  (upper  and  lower) 
bounds  requiring  the  supremum  over  all  distributions  for  both  general  (i.e.,  probabilistic)  active 
learning  algorithms  and  for  deterministic  algorithms.  For  the  first  part  of  the  theorem,  we  need 
the  following  restdt  relating  the  VC  dimension  of  a  concept  class  to  its  metric  entropy:  the  VC 
dimension  of  C  is  finite  iff  supp  N{€,  C,P)  <  00  for  all  e  >  0  (e.g.,  see  [5]  or  [13]  and  references 
therein).  The  first  part  of  the  theorem  follows  immediately  from  this  result.  | 
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Thus,  even  with  this  extra  ‘side’  information,  the  set  of  learnable  concept  classes  is  not  enlarged 
by  allowing  an  active  learner.  However,  as  before  one  would  expect  an  improvement  in  the  number 
of  samples  required.  A  direct  comparison  is  not  immediate  since  the  bounds  for  passive  learnability 
involve  the  VC  dimension,  while  the  restdts  above  are  in  terms  of  the  metric  entropy.  A  comparison 
can  be  made  using  bounds  relating  the  VC  dimension  of  a  concept  class  to  its  metric  entropy  with 
respect  to  various  distributions,  which  provide  upper  and  lower  bounds  to  supp  N{e,  C,P).  Upper 
bounds  are  more  difficult  to  obtain  since  these  require  a  uniform  botmd  on  the  metric  entropy 
over  all  distributions.  The  lower  bounds  result  from  statements  of  the  form  that  there  exists  a 
distribution  P  (typically  a  uniform  distribution  over  some  finite  set  of  points)  for  which  C,  P) 
is  greater  than  some  ftmction  of  the  VC  dimension.  However,  most  previous  lower  bounds  are  not 
particularly  useful  for  small  e  —  i.e.,  the  bounds  remain  finite  as  e  — >  0.  This  is  the  best  that  can 
be  obtained  assuming  only  that  C  has  VC  dimension  d  <  oo,  since  C  itself  could  be  finite.  The 
following  result  assumes  that  C  is  infinite  but  makes  no  assumption  about  the  VC  dimension  of  C. 

Lemma  2  Let  C  be  a  concept  class  with  an  infinite  number  of  distinct  concepts.  Then  for  each 
e  >  0  there  is  a  probability  distribution  P  such  that  N{e,C,P)  >  l/2e. 

Proof:  First,  we  show  by  induction  that  given  n  distinct  concepts,  n  -  1  points  xi,..  .,Xn-i  can 
be  found  such  that  the  n  concepts  give  rise  to  distinct  subsets  of  *1, . . This  is  cleauly  true  for 
n  =  2.  Suppose  it  is  true  for  n  =  k.  Then  for  n  =  A:  +  1  concepts  ci, . . . ,  Cfc+i  apply  the  induction 
hypothesis  to  ci,...,Cfe  to  get  a;i, . . . ,  a:fe_i  which  distinguish  ci,...,Cfe.  Ck+i  can  agree  with  at 
most  one  of  ci, . . . ,  Cfc.  Then  another  point  Xk  can  be  chosen  to  distinguish  these  two. 

Now,  let  €  >  0  and  set  n  =  .  Let  ci, . . . ,  c„  be  ra  distinct  concepts  in  C,  and  let  «!,...,  *„_! 

be  n  - 1  points  that  distinguish  Ci , . . . ,  c„.  Let  P  be  the  uniform  distribution  on  si , . . . ,  a:„_i .  Since 
the  Ci  are  distinguished  by  the  Xi,  dp{ci,Cj)  >  l/(n-  1)  =  1/{L^J  -  1)  >  2e.  Hence,  every  concept 
is  within  e  to  at  most  one  of  ci, . . . ,  c„  so  that  N{e,  C,P)>  n  =  [ij .  | 

The  following  theorem  summarizes  the  result  of  the  lemma  and  previous  upper  and  lower  botmds 
obtained  by  others. 

Theorem  4  Let  C  be  a  concept  class  with  infinitely  many  concepts  and  let  1  <  d  <  00  be  the  VC 
dimension  of  C.  For  e  <  1/4, 

sup  logj  N{€,  C,  P)  >  max(2d(l/2  -  2e)^  logj  e,  logg 

and  for  e  <  1/ 2d, 

suplog2  N(e,  C,  P)  <  dlog2(  —  In  — )  +  1 
P  €  € 

Proof:  The  first  term  of  the  lower  bound  is  from  [13]  and  the  second  term  of  the  lower  bound 
follows  from  Lemma  2.  The  upper  bound  is  from  [10]  which  is  a  refinement  of  a  result  from  [16] 
using  techniques  originally  from  [7].  A  weaker  upper  bound  was  also  given  in  [5].  | 

This  theorem  gives  boxmds  on  the  number  of  questions  needed  in  distribution-free  active  learning 
(with  the  side  information)  directly  in  terms  of  e,  S  and  the  VC  dimension  of  C.  As  stated,  the 
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botmds  are  directly  applicable  to  deterministic  active  learning  algorithms  or  for  active  learning  with 
^  —  0.  For  probabilistic  algorithms  with  ^  >  0  the  quantity  log2  1/(1  —  needs  to  be  subtracted 
from  both  the  lower  and  upper  botmds. 


5  Discussion 

In  this  paper  we  considered  the  effect  on  PAC  learnability  of  allowing  a  rich  set  of  learner- 
enviromnent  interactions.  Previous  work  along  these  lines  has  provided  the  learner  with  access 
to  various  types  of  oracles.  Many  of  the  oracles  considered  in  the  literature  answer  queries  which 
are  special  cases  of  yes/no  questions  (although  those  oracles  that  provide  counterexamples  are  not 
of  this  type).  As  expected,  the  use  of  oracles  can  often  aid  in  the  learning  process.  To  understand 
the  limits  of  how  much  could  be  gained  through  oracles,  we  have  considered  an  active  learning 
model  in  which  the  learner  chooses  the  information  received  by  asking  arbitrary  yes/no  questions 
about  the  unknown  concept  and/or  probability  distribution. 

For  a  fixed  distribution,  active  learning  does  not  enlarge  the  set  of  learnable  concept  classes,  but 
it  does  have  lower  sample  complexity  than  passive  learning.  For  distribution-free  active  learning, 
the  set  of  learnable  concept  classes  is  drastically  reduced  to  the  degenerate  case  of  finite  concept 
classes.  Furthermore,  even  if  the  learner  is  told  the  distribution  but  is  still  required  to  learn 
uniformly  over  all  distributions,  a  concept  class  is  actively  learnable  iff  it  has  finite  VC  dimension. 

For  completeness,  we  mention  that  results  can  also  be  obtained  if  the  learner  is  provided  with 
‘noisy’  answers  to  the  queries.  The  effects  of  various  types  of  noise  in  passive  learning  have  been 
studied  [2,  11,  18].  For  active  learning,  two  natural  noise  models  are  random  noise  in  which 
the  answer  to  a  query  is  incorrect  with  some  probability  rj  <  1/2  independent  of  other  queries, 
and  malicious  noise  in  which  an  adversary  gets  to  choose  a  certain  number  of  queries  to  receive 
incorrect  answers.  For  random  noise,  the  problem  is  equivalent  to  communication  through  a  binary 
symmetric  channel,  so  that  standard  results  from  information  theory  on  the  capacity  and  coding  for 
such  chaimels  [9]  can  be  applied.  For  malicious  noise,  some  results  on  binary  searching  with  these 
types  of  errors  [17]  can  be  applied.  For  both  noise  models,  the  conditions  for  fixed  distribution  and 
distribution-free  learnability  are  the  same  as  the  noise-free  case,  but  with  a  larger  sample  complexity. 
However,  the  more  interesting  aspects  of  our  results  are  the  indications  of  the  limitations  of  active 
learning,  and  the  noise-free  case  makes  stronger  negative  statements. 

Finally,  an  open  problem  that  may  be  interesting  to  pursue  is  to  study  the  reduction  in  sample 
complexity  of  distribution-free  learning  if  the  learner  has  access  to  both  random  examples  and 
arbitrary  yes/no  questions.  This  is  similar  to  the  problem  considered  in  [8],  but  there  the  learner 
could  choose  only  examples  to  be  labeled  rather  than  ask  arbitrary  questions.  Our  result  for  the  case 
where  the  learner  knows  the  distribution  being  used  provides  a  lower  bound,  but  if  the  distribution 
is  not  known  then  we  expect  that  for  certain  concept  classes  much  stronger  lower  botmds  would 
hold.  In  particular,  we  conjecture  that  results  analogous  to  those  in  [8]  hold  in  the  case  of  arbitrary 
binary  valued  questions,  so  that,  for  example,  asking  yes/no  questions  could  reduce  the  sample 
complexity  to  learn  a  dense-in-itself  concept  class  by  only  a  constant  factor. 
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